49 research outputs found

    Learning Schemas for Unordered XML

    Get PDF
    We consider unordered XML, where the relative order among siblings is ignored, and we investigate the problem of learning schemas from examples given by the user. We focus on the schema formalisms proposed in [10]: disjunctive multiplicity schemas (DMS) and its restriction, disjunction-free multiplicity schemas (MS). A learning algorithm takes as input a set of XML documents which must satisfy the schema (i.e., positive examples) and a set of XML documents which must not satisfy the schema (i.e., negative examples), and returns a schema consistent with the examples. We investigate a learning framework inspired by Gold [18], where a learning algorithm should be sound i.e., always return a schema consistent with the examples given by the user, and complete i.e., able to produce every schema with a sufficiently rich set of examples. Additionally, the algorithm should be efficient i.e., polynomial in the size of the input. We prove that the DMS are learnable from positive examples only, but they are not learnable when we also allow negative examples. Moreover, we show that the MS are learnable in the presence of positive examples only, and also in the presence of both positive and negative examples. Furthermore, for the learnable cases, the proposed learning algorithms return minimal schemas consistent with the examples.Comment: Proceedings of the 14th International Symposium on Database Programming Languages (DBPL 2013), August 30, 2013, Riva del Garda, Trento, Ital

    gMark: Schema-Driven Generation of Graphs and Queries

    Full text link
    Massive graph data sets are pervasive in contemporary application domains. Hence, graph database systems are becoming increasingly important. In the experimental study of these systems, it is vital that the research community has shared solutions for the generation of database instances and query workloads having predictable and controllable properties. In this paper, we present the design and engineering principles of gMark, a domain- and query language-independent graph instance and query workload generator. A core contribution of gMark is its ability to target and control the diversity of properties of both the generated instances and the generated workloads coupled to these instances. Further novelties include support for regular path queries, a fundamental graph query paradigm, and schema-driven selectivity estimation of queries, a key feature in controlling workload chokepoints. We illustrate the flexibility and practical usability of gMark by showcasing the framework's capabilities in generating high quality graphs and workloads, and its ability to encode user-defined schemas across a variety of application domains.Comment: Accepted in November 2016. URL: http://ieeexplore.ieee.org/document/7762945/. in IEEE Transactions on Knowledge and Data Engineering 201

    Interactive Join Query Inference with JIM

    Get PDF
    National audienceSpecifying join predicates may become a cumbersome task in many situations e.g., when the relations to be joined come from disparate data sources, when the values of the attributes carry little or no knowledge of metadata, or simply when the user is unfamiliar with querying formalisms. Such task is recurrent in many traditional data management applications, such as data integration, constraint inference, and database denormalization, but it is also becoming pivotal in novel crowdsourcing applications. We present Jim (Join Inference Machine), a system for interactive join specification tasks, where the user infers an n-ary join predicate by selecting tuples that are part of the join result via Boolean membership queries. The user can label tuples as positive or negative, while the system allows to identify and gray out the uninformative tuples i.e., those that do not add any information to the final learning goal. The tool also guides the user to reach her join inference goal with a minimal number of interactions

    Interactive Inference of Join Queries

    Get PDF
    We investigate the problem of inferring join queries from user interactions. The user is presented with a set of candidate tuples and is asked to label them as positive or negative depending on whether she would like the tuples as part of the join result or not. The goal is to quickly infer an arbitrary n-ary join predicate across two relations by keeping the number of user interactions as minimal as possible. We assume no prior knowledge of the integrity constraints between the involved relations. This kind of scenario occurs in several application settings, such as data integration, reverse engineering of database queries, and constraint inference. In such a scenario, the user has either a partial knowledge of the database schemas or the database instances are too big to be skimmed. We explore the search space by using a set of strategies that let us prune what we call “uninformative” tuples, and directly present to the user the informative ones i.e., those that lead to quickly find the goal query that the user has in mind. In this paper, we focus on the inference of joins with equality predicates and we show that for such joins deciding whether a tuple is uninformative can be done in polynomial time. Next, we propose several strategies for presenting tuples to the user in a given order that lets minimize the number of interactions. We show the efficiency and scalability of our approach through an experimental study on both benchmark and synthetic datasets. Finally, we prove that adding projection to our queries makes the problem intractable. 1

    Learning Join Queries from User Examples

    Get PDF
    International audienceWe investigate the problem of learning join queries from user examples. The user is presented with a set of candidate tuples and is asked to label them as positive or negative examples, depending on whether or not she would like the tuples as part of the join result. The goal is to quickly infer an arbitrary n-ary join predicate across an arbitrary number m of relations while keeping the number of user interactions as minimal as possible. We assume no prior knowledge of the integrity constraints across the involved relations. Inferring the join predicate across multiple relations when the referential constraints are unknown may occur in several applications, such as data integration, reverse engineering of database queries, and schema inference. In such scenarios, the number of tuples involved in the join is typically large. We introduce a set of strategies that let us inspect the search space and aggressively prune what we call uninformative tuples, and we directly present to the user the informative ones that is, those that allow the user to quickly find the goal query she has in mind. In this article, we focus on the inference of joins with equality predicates and also allow disjunctive join predicates and projection in the queries. We precisely characterize the frontier between tractability and intractability for the following problems of interest in these settings: consistency checking, learnability, and deciding the informativeness of a tuple. Next, we propose several strategies for presenting tuples to the user in a given order that allows minimization of the number of interactions. We show the efficiency of our approach through an experimental study on both benchmark and synthetic datasets

    Simple Schemas for Unordered XML

    Get PDF
    We consider unordered XML, where the relative order among siblings is ignored, and propose two simple yet practical schema formalisms: disjunctive multiplicity schemas (DMS), and its restriction, disjunction-free multiplicity schemas (MS). We investigate their computational properties and characterize the complexity of the following static analysis problems: schema satisfiability, membership of a tree to the language of a schema, schema containment, twig query satisfiability, implication, and containment in the presence of schema. Our research indicates that the proposed formalisms retain much of the expressiveness of DTDs without an increase in computational complexity. 1

    A Paradigm for Learning Queries on Big Data

    Get PDF
    International audienceSpecifying a database query using a formal query language is typically a challenging task for non-expert users. In the context of big data, this problem becomes even harder as it requires the users to deal with database instances of big sizes and hence difficult to visualize. Such instances usually lack a schema to help the users specify their queries, or have an incomplete schema as they come from disparate data sources. In this paper, we propose a novel paradigm for interactive learning of queries on big data, without assuming any knowledge of the database schema. The paradigm can be applied to different database models and a class of queries adequate to the database model. In particular, in this paper we present two instantiations that validated the proposed paradigm for learning relational join queries and for learning path queries on graph databases. Finally, we discuss the challenges of employing the paradigm for further data models and for learning cross-model schema mappings

    Schemas for Unordered XML on a DIME

    Get PDF
    We investigate schema languages for unordered XML having no relative order among siblings. First, we propose unordered regular expressions (UREs), essentially regular expressions with unordered concatenation instead of standard concatenation, that define languages of unordered words to model the allowed content of a node (i.e., collections of the labels of children). However, unrestricted UREs are computationally too expensive as we show the intractability of two fundamental decision problems for UREs: membership of an unordered word to the language of a URE and containment of two UREs. Consequently, we propose a practical and tractable restriction of UREs, disjunctive interval multiplicity expressions (DIMEs). Next, we employ DIMEs to define languages of unordered trees and propose two schema languages: disjunctive interval multiplicity schema (DIMS), and its restriction, disjunction-free interval multiplicity schema (IMS). We study the complexity of the following static analysis problems: schema satisfiability, membership of a tree to the language of a schema, schema containment, as well as twig query satisfiability, implication, and containment in the presence of schema. Finally, we study the expressive power of the proposed schema languages and compare them with yardstick languages of unordered trees (FO, MSO, and Presburger constraints) and DTDs under commutative closure. Our results show that the proposed schema languages are capable of expressing many practical languages of unordered trees and enjoy desirable computational properties.Comment: Theory of Computing System

    General Terms

    Get PDF
    Web applications store their data within various database models, such as relational, semi-structured, and graph data models to name a few. We study learning algorithms for queries for the above mentioned models. As a further goal, we aim to apply the results to learning cross-model database mappings, which can also be seen as queries across different schemas. Categories and Subject Descriptor

    Requêtes et Schémas Hétérogènes : Complexité et Apprentissage

    No full text
    Specifying a database query using a formal query language is typically a challenging task for non-expert users. In the context of big data, this problem becomes even harder because it requires the users to deal with database instances of large size and hence difficult to visualize. Such instances usually lack a schema to help the users specify their queries, or have an incomplete schema as they come from disparate data sources. In this thesis, we address the problem of query specification for non-expert users. We identify two possible approaches for tackling this problem: learning queries from examples and translating the data in a format that the user finds easier to query. Our contributions are aligned with these two complementary directions and span over three of the most popular data models: XML, relational, and graph. This thesis consists of two parts, dedicated to (i) schema definition and translation, and to (ii) learning schemas and queries. In the first part, we define schema formalisms for unordered XML and we analyze their computational properties; we also study the complexity of the data exchange problem in the setting of a relational source and a graph target database. In the second part, we investigate the problem of learning from examples the schemas for unordered XML proposed in the first part, as well as relational join queries and path queries on graph databases. The interactive scenario that we propose for these two classes of queries is immediately applicable to assisting non-expert users in the process of query specification.La spécification de requêtes est généralement une tâche difficile pour les utilisateurs non-experts. Le problème devient encore plus difficile quand les utilisateurs ont besoin d'interroger des bases de données de grande taille et donc difficiles à visualiser. Le schéma pourrait aider à cette spécification, mais celui-ci manque souvent ou est incomplet quand les données viennent de sources hétérogènes. Dans cette thèse, nous abordons le problème de la spécification de requêtes pour les utilisateurs non-experts. Nous identifions deux approches pour attaquer ce problème : apprendre les requêtes à partir d'exemples ou transformer les données dans un format plus facilement interrogeable par l'utilisateur. Nos contributions suivent ces deux directions et concernent trois modèles de données parmi les plus populaires : XML, relationnel et orienté graphe. Cette thèse comprend deux parties, consacrées à (i) la définition et la transformation de schémas, et (ii) l'apprentissage de schémas et de requêtes. Dans la première partie, nous définissons des formalismes de schémas pour les documents XML non-ordonnés et nous analysons leurs propriétés computationnelles; nous étudions également la complexité du problème d'échange de données entre une source relationnelle et une cible orientée graphe. Dans la deuxième partie, nous étudions le problème de l'apprentissage à partir d'exemples pour les schémas XML proposés dans la première partie, ainsi que pour les requêtes de jointures relationnelles et les requêtes de chemins sur les graphes. Nous proposons notamment un scénario interactif qui permet d'aider des utilisateurs non-experts à définir des requêtes dans ces deux classes
    corecore